瑞典专利SE1150966A1 Digital signal processor and baseband communication device

专利PDF首页>>瑞典专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
For increased efficiency, a digital signal processor (200) comprises a vector execution unit (203, 205) arranged to execute instructions that are to be performed on multiple data in the form of a vector, comprising a vector controller (720) arranged to determine if an instruction is a vector instruction and, if it is, inform a count register (732) arranged to hold the vector length, said vector controller being further arranged receive an issue signal and control the execution of instructions based on this issue signal, said vector execution unit being characterized in that it comprises a local queue (730) arranged to receive instructions from a program memory and to hold them in the local queue until a predefined condition is fulfilled, and that the vector controller comprises queue control means (732, 721) arranged to control the local queue.
公开号:SE1150966A1
申请号:SE1150966
申请日:2011-10-18
公开日:2013-04-19
发明作者:Anders Nilsson
申请人:Mediatek Sweden Ab；
IPC主号:

专利说明:

Has a program memory for distributing instructions to the execution units. In WO 2007/018467, each of the vector execution units has a separate instruction decoder. This enables the use of the vector execution units independently of each other and of other parts of the processor in an efficient manner.
The instruction set architecture for the SIMT processor can typically include three classes of composite instructions.
- RISC instructions, fold operate on 16-bit integer operands. The RISC instruction class includes most of the control-oriented instructions and can be executed within the integer execution unit of the processor core.
- DSP instructions, which operate on complex value data having a real part and an imaginary part. The DSP instructions can be executed on one or more of the SIMD clusters.
- Vector instructions. Vector instructions can be considered as extensions of the DSP instructions, as they operate on large amounts of data and can utilize advanced address modes and vector support.
The SIMT architecture therefore offers at the same time the performance of both the task level and SIMD vector calculation and sufficient RISC control flexibility.
In a SIMT architecture, there are therefore many execution units. Normally, an instruction can be issued from the program memory to one of the execution units at each clock cycle. Since vector operations typically operate on large vectors, an instruction received in a vector execution unit during a clock cycle will require a number of clock cycles to be processed. In the subsequent clock cycles, therefore, instructions can be given to other calculation units of the processor. Because Vector instructions run on long vectors, many RISC instructions can be executed simultaneously with the vector operation.
Many baseband algorithms can be converted into chains of smaller baseband data with small backward dependencies between the data. This feature can not only allow different tasks to be performed in parallel on vector execution units, but it can also be utilized using the above set of instruction set architecture. 10 15 20 25 30 To provide synchronization of the control ﬂ fate and to control the data in fate, “idle instructions” can often be used to stop the control tills fate until a given vector operation is completed. The "idle instruction" will stop further instruction retrieval until a certain condition is met. Such a condition may be the termination of a vector instruction in a vector execution unit.
As will be discussed in more detail below, a DSP task will typically include a sequence of one to ten instructions. This means that the vector execution unit will receive a vector instruction, such as performing a calculation, and executing it on the advanced data vector until it is complete with the entire vector. The next instruction will be to process the result and store it in memory, which can theoretically occur immediately after the calculation has been performed on the entire vector. Often, however, a vector execution unit has to wait ﬂ your clock cycles for its subsequent instruction from the program memory because the processor core is busy waiting for other vector units to exit, resulting in inefficient use of the vector execution unit.
This probability that a vector execution unit is kept idle increases with increasing number of vector execution units in a system.
Summary of the Invention An object of the present invention is to make the processing of vector instructions in a SIMT architecture more efficient.
This object is achieved according to the present invention by a vector execution unit for use in a digital signal processor, the vector execution unit being arranged to execute instructions, comprising vector instructions which are to be executed on many data in the form of a vector, comprising a vector control arranged to determine whether a instruction is a vector instruction and, if it is, informing a count register arranged to contain the vector length, the vector control further being arranged to control the execution of instructions, the vector execution unit being characterized in that it comprises a local queue arranged to receive at least one first and a second instruction from a program memory and including the second instruction in the local queue until a predetermined condition is met, and that the vector control comprises queue control means arranged to control the local queue.
Preferably, the vector control controls the execution of instructions based on a delivery signal received from the core. Alternatively, the output signal can be handled locally by the vector execution unit itself.
Through the local queue provided for each vector execution unit, a set of instructions including your vector unit instructions can be provided to the vector unit at a time. to enable synchronization of instructions in the local queue to the execution of vector instructions, an instruction called the SYNC instruction is provided, which will pause the reading of instructions from the local queue until a condition is met, typically the data path is ready to receive and execute another instruction. These two features together allow a sequence of instructions to be immediately sent to the vector execution unit, to be stored in the local queue and processed sequentially in the vector execution unit, so that as soon as the vector execution unit finishes an instruction, it can start on the next. In this way, each vector execution unit can operate with a minimum of idle time.
The processing according to the invention consequently becomes more efficient by increasing the parallelism in the processor, since the vector execution units can operate more independently of each other. The invention is based on the insight that in the prior art a vector execution unit often cannot receive the next instruction immediately, since all vector execution units receive their commands from the same queue, ie. the program memory in the processor core. This will occur when a vector execution unit is ready to receive a new command while the first command in the program is intended for another vector execution unit, which is busy. In this case, no vector execution unit can receive a new command until the other vector execution unit is ready to receive its next command.
In a preferred embodiment, the vector execution unit further comprises - An instruction register arranged to receive and store instructions 10 15 20 25 30 - An instruction decoder arranged to decode instructions stored in the instruction register - A number of data paths controlled by the instruction decoder.
The local queue is preferably arranged to pause the reading of instructions until the data path is ready to receive and execute another instruction. This will optimize the queue management in the vector instruction and the overall handling of instructions in the processor to which the vector execution unit belongs.
Said queue control means preferably comprises a queue control arranged to contain the status information related to the queue, such as how full the queue is, and to control the transmission of instructions from the local queue to the vector execution unit for execution. The queue control can also be arranged to generate an error message if a new instruction is sent to the queue and the queue is full.
Said queue control means may be arranged to emit a specific signal which instructs the local queue to pause the reading of instructions from the local queue until a specific condition is met, for example that the data path is ready to accept a new instruction.
The vector control is preferably arranged to cause a signal to be sent to a program control unit of the digital signal processor to indicate that the unit is ready to accept a new instruction. The transmission of this signal may be based on information transmitted from the instruction decoder to the Vector control whether the instruction is executed at any given time. The signal can also be based on the number of instructions that are currently in the queue, for example if there is room for your instructions in the queue.
The invention also relates to a digital signal processor comprising: - a processor core comprising an integer execution unit configured to execute integer instructions; and - at least one first and a second vector execution unit separate from and coupled to the processor core, each vector execution unit being a vector execution unit as above; wherein the digital signal processor comprises a program memory arranged to contain instructions for the first and second vector execution units and output logic for outputting instructions, including vector instructions, to the first and second vector execution units.
Such a digital processor will enable more simultaneous or parallel use of its vector execution units.
The program memory is typically arranged in the processor core and is also arranged to contain instructions for the integer execution unit.
The invention also relates to a baseband communication device suitable for wire cage and wireless remote communication, comprising: - A front-end unit configured to transmit and receive communication signals; - A programmable digital signal processor connected to the analog front end unit, the programmable digital signal processor being a digital signal processor as above.
In a preferred embodiment, the vector execution units referred to throughout this document are SIMD-type vector execution units or programmable auxiliary processors arranged to operate on vectors of data.
The local queue can be a FIFO (First In First Out) queue of the desired length, for example 4 to 8 instructions. It can also be any other type of suitable queue.
The processor according to embodiments of this invention is particularly useful for digital signal processors, especially baseband processors. The front end unit may be an analog front end unit arranged to transmit and / or receive radio frequency or baseband signals. 10 15 20 25 30 Such processors are often used in various types of communication devices, such as mobile phones, TV receivers and cable modems. The baseband communication device can consequently actually be arranged for communication in a cellular communication network, for example as a mobile telephone or a mobile data communication device. The baseband communication device can also be arranged for communication according to other wireless standards, such as Bluetooth or WiFi. It can also be a television receiver, a cable modem, WiFi modem or any other type of communication device capable of delivering a baseband signal to its processor. It should be understood that the term "baseband" refers only to the signal handled internally in the processor. The actually received and / or transmitted communication signals may be any suitable type of communication signals, received on wireless or wired connections. The communication signals are suitably converted into a baseband signal through a front end unit of the device.
Brief description of the drawing The invention will in the following be described in more detail, in the form of examples and with reference to the accompanying drawings.
FIG. 1 is a system overview of a typical mobile terminal comprising a baseband processor.
FIG. 2 shows an example of the SIMT architecture.
FIG. 3 is a block diagram of the baseband processor according to an embodiment of the invention.
FIG. 4 is a diagram showing instructional delivery pipelines of one embodiment of the processor core of FIG. 2.
FIG. 5 shows the instruction delivery logic in SIMT processors.
FIG. 6 shows a SIMT unit according to prior art.
FIG. 7 shows a SIMT unit having the added features of a general embodiment according to the invention.
FIG. 8 shows a SIMT device according to a preferred embodiment of the invention.
FIG. 9 shows the working principle for the local sex according to an embodiment of the invention. Detailed Description of Embodiments FIG. 1 shows an example of a mobile terminal 1 comprising a baseband processor 3 which will be the main subject of this application. As is known in the art, the tenninal 1 comprises means for receiving and transmitting communication signals. In this example, it comprises antennas 5 connected to an analog front-end unit 7, comprising an analog-to-digital converter ADC for the reception direction and a digital-to-analog converter DAC for the transmission direction. The analog front-end unit 7 is connected to the baseband processor 3 ,.
The baseband processor 3 normally, but not necessarily, includes an FEC (Forward Error Correction) processor 9, for error correction functions such as interleaving, viterbia decoding, etc., as is known in the art. The baseband processor 3 is typically in turn connected to a MAC unit 11, which in turn is connected to an application processor 13.
Typically, but not necessarily, the terminal 1 has a bus and memory subsystem that interconnects the baseband processor, the MAC unit 11 and the application processor n 13. The terminal also includes peripheral interfaces 17 for user input / output, typically including a keyboard, a camera interface and interfaces for connections to other devices, such as a USB interface.
As those skilled in the art would appreciate, said analog front end may be arranged to handle any type of incoming and outgoing signals including radio frequency signals, baseband signals and others and to provide a baseband signal to the baseband processor 3.
FIG. 2 shows an example of a baseband processor 200 according to the SIMT architecture.
The processor 200 includes a controller core 201 and a first 203 and a second 205 vector execution unit, which will be discussed in more detail in the following. An FEC unit 206, as discussed in FIG. 1, are connected to the on-chip network. In a concrete implementation, of course, the FEC unit 206 may include ﬂ your various units.
A host interface unit 207 provides connection to the host processor shown in FIG. 1 (not shown in FIG. 2). If a MAC processor is present, as shown in FIG. 1, the MAC processor is connected between the host interface unit 207 and the host processor. A digital front-end unit 209 provides connection to the ADC / DAC unit shown in FIG. l in a manner well known in the art.
As is known in the art, the control core 201 comprises a program memory 211 as well as an instruction delivery logic and functions for multi-context support. for each supported execution context or thread, it includes a program counter, stack pointer and register file (not particularly shown in FIG. 2). Typically 2-3 threads are supported.
The control core 201 also includes an integer execution unit 212 including a register RF 1 RF, a core integer memory ICM, a multiplier unit MUL and an arithmetic and logic / ski unit (ALSU). The ALSU can also be implemented as two units: arithmetic unit and logic and shift unit. These devices are known in the art and are not shown in FIG. 2.
The first vector execution unit 203 in this example is a CMAC vector execution unit, comprising a vector control 213, a vector charge / storage unit 215 and a number of data paths 217. The vector control of this first vector execution unit is connected to the program memory 211 of the control core 201 for the control core 201. to receive output signals related to instructions from the program memory. In the description above, the delivery logic decodes the instruction word to obtain the delivery signal and sends this delivery signal to the vector execution unit as a separate signal. It would also be possible to have the vector control of the vector execution unit generate the output signal locally. In this case, the output signals are formed by the vector control based on the instruction word in the same way as it would be in the output logic.
A second vector execution unit 205 is a CALU vector execution unit comprising a vector control 223, a vector charge / storage unit 225 and a plurality of data paths 227. The vector control 223 of this second vector execution unit is also connected to the control core program program memory 211. to instructions from the program memory. The operation of the data paths 217, 227 and the vector charge / storage units 215, 225 will be discussed below.
Any number of vector execution units can be found, including CMAC units only, CALU units only, or an appropriate number of each type.
There may also be other types of vector execution devices other than CMAC and CALU. As explained above, a vector execution unit is a processor that can process vector instructions, which means that a single instruction performs the same function on a number of data units. Data can be complex or real, and are grouped into bytes or words and packed into a vector to be operated on by a vector execution unit. In this document, CALU and CMAC units are used as examples, but it should be noted that vector execution units can be used to perform any suitable function on vectors of data.
To enable your simultaneous vector operations, the processor preferably has a distributed memory system in which the memory is divided into your memory banks, which in FIG. 2 is represented by Minnesbank O 230 to Minnesbank N 231. Each memory bank 230, 231 has its own complex memory 232, 233 and address generation unit AGU 234 resp. 235. This arrangement in conjunction with the on-chip network improves the power efficiency of the memory system and the throughput of the processor because ﬂ simple address calculations can be performed in parallel. The PBBP shown in FIG. 2 also includes one or more memory banks 238, including a memory 239 and an address generating unit.
As is known in the art, a number of accelerators 242 are typically connected, as they enable efficient implementation of certain baseband functions, such as channel coding and interleaving. Such accelerators are well known in the art and will not be discussed in detail here. The accelerators can be configurable to be reusable by many different standards.
An on-chip network 244 connects the control core 201, the digital front-end unit 209, the host interface unit 207, the vector execution units 203, 205, the memory banks 230, 232, the integer bank 238 and the accelerators 242. Each vector execution unit 203 , 205 comprises a vector loading / storage unit 215, 225 arranged to act as an interface between the network port and the data path in the vector execution unit. the execution units 203, 205 are typically connected to the memory banks 230, 23 via the network 244, but there may also be support for connections to other units such as accelerators 242 and other vector execution units. The load function is used to retrieve data from the other units connected to the network 244 (for example a memory bank) and the storage function is used to store data from the execution units 203, 205 to e.g. a memory unit 230, 231 via the network 244. Data can also be obtained from other vector execution units and / or the calculation results can be fed to other vector execution units for further processing. Each vector execution unit also includes a vector controller 213, 223 arranged to receive instructions from the program memory PM 211. The vector loading units 215, 225 can load data using two different modes. In the first mode, your data elements can be loaded from a memory bank 23 0, 232 or other sources, as discussed above. In the second mode, data can be loaded one data element at a time and then distributed to the SIMD data paths in a specific data execution unit. The latter mode can be used to reduce the number of memory accesses when processing data as a result of the execution unit.
In the illustrated embodiment, the second vector execution unit 205 is shown as a four-way complex ALU which may include four independent data paths 227, each of which has a complex multiplier-accumulator (CSMAC) as is common in the art. As will be described in more detail below, CALU 205 can execute vector instructions. In one embodiment, CALU 205 may be particularly suitable for executing complex vector instructions. Furthermore, each of the independent data paths 227 of CALU 205 can simultaneously execute the complex vector instructions.
The first vector execution unit 203 is shown as a four-way CMAC with four complex data paths that can be run simultaneously or separately. The four complex data paths include multipliers, adders, and collection registers (all of which are not shown in FIG. 2). Accordingly, in this embodiment, CMAC 203 may be referred to as a four-way CMAC data path. In addition to multiplication and addition, CMAC 203 can also perform rounding and scaling operations and support saturation, as is known in the art.
In one form of operation, operations of the CMAC 203 can be divided into multiple pipeline steps. In addition, each of the four complex data paths 217 can calculate a complex multiplication and accumulation in a clock cycle. CMAC 203 (ie the four data paths together) can execute an operation on a vector with N elements of N / 4 clock cycles to support complex vector calculation (ie complex vector convolution, conjugated complex convolution and complex scalar vector product). CMAC 203 can further support operations on complex values stored in the collection registers (eg complex addition, subtraction, conjugation, etc.). For example, CMAC 203 can calculate a complex multiplication such as (AR + JAI) * (BR + JBI) in a clock cycle and complex accumulation in a clock cycle and support complex vector computation (eg complex convolution, conjugated complex convolution and complex scalar vector product). In one embodiment, the instruction set architecture for the processor core 201 may include three classes of composite instructions. The first class of instructions are RISC instructions, which operate on 16-bit operands. The RISC instruction class includes the most control-oriented instructions and can be executed within the integer execution unit 212 of the processor core 201. The next class of instructions are DSP instructions, which operate on complex-valued data having a real part and a complex part. The DSP instructions can be executed on one or more of the vector execution units 203, 205. The third class of instructions are the vector instructions. Vector instructions can be considered extensions of the DSP instructions, as they operate on large amounts of data and can take advantage of advanced addressing modes and vector support. The vector instructions can operate on complex and real data types.
FIG. 3 is a block diagram of the baseband processor PBBP 200 according to an embodiment of the invention. PBBP 200 includes a processor core, which includes an RISC-type execution unit and which is represented by the RISC data path 510.
The PBBP further has a number of vector execution units 520, 530, each of which comprises a vector control unit 275 and 275, respectively. a SIMD data path 525, 535. As is known in the art, each data path 525, 535 may include dat your data paths. For example, the data path 525 typically has four parallel CMAC data paths, which together form the data path 525.
To provide control of the multiple vector execution units, the core hardware 500 includes a program styr fate controller 501 coupled to a program counter 502, which in turn is connected to a program memory (PM) 503. PM 503 is coupled to a multiplexer 504, unit field recovery ( unit- field extraction) 508.
The multiplexer 504 is connected to an instruction register 505, which is connected to an instruction decoder 506. The instruction decoder 506 is further connected to a control signal register (CSR) 507, which in turn is connected to the remainder of the RISC data path.
Similarly, each of the vector execution units 520 and 530 is also arranged to receive instructions from the program memory 503 located in the core.
The vector execution units include the respective vector length registers 521, 531, instruction registers 522, 532, instruction decoders 523,533 and CSRs 524, 534, which are connected to their respective data paths 525 and 535. These units and their functions will be discussed in more detail, insofar as they are of importance. for the invention, in connection with FIG. 5.
FIG. 4 is an example of a prior art for processing instructions from the program to the various execution units, intended as an illustration of the problem underlying the invention. The left column of FIG. 4 indicates time (in execution clock cycles). The other columns indicate from left to right execution pipelines of a first and a second execution unit (more specifically the data paths of CMAC 203 and CALU 205) and the integer execution unit and the issuance of instructions to this. More specifically, in the first clock cycle, a complex vector instruction (eg, CMAC.256) is output to CMAC 203. As shown, the vector instruction needs many cycles to complete. In the next clock cycle a vector instruction is given to CALU 205. In the next clock cycle an integer instruction is given to the integer execution unit 510. In subsequent cycles, while the vector instructions are executed, any number of integer instructions is given to the integer execution unit 10. the remaining vector execution units can also simultaneously execute instructions in a corresponding manner.
In some cases, an “idle instruction” may be included in the sequence of instructions, to stop the combat program ﬂ fate control from retrieving instructions from the program memory. For example, to synchronize the program fate with the end of a vector instruction, the "idle instruction" can be used to cancel the retrieval of instructions until a certain condition is met. This condition will typically be that the vector execution unit in question has completed a previous vector instruction and is able to receive a new instruction. In this case, the vector control 275 of the vector execution unit 520, 530 in question will send an indication, such as an ga agga, to the program fate control 501, indicating that the vector execution unit is ready to receive another instruction.
Idle instructions can be used simultaneously for more than one vector execution device. In this case, no further instructions can be sent from the program memory 503 before each vector execution unit 520, 530 in question has sent an ﬂ agga indicating that it is ready to receive a new instruction.
In the example of FIG. 4, the “idle instruction” is given according to the above-mentioned integer instructions. The idle instruction is used in this example to stop the control until the vector operation performed by CMAC 203 is completed.
The following examples will be discussed on the basis of a SIMT DSP with an arbitrary number of execution units. For simplicity, it is assumed that all devices in this example are CMAC execution devices, but in practice, devices of different types will be mixed and used together.
In many baseband processing algorithms and programs, the algorithm can be divided into a number of DSP tasks, each of which consists of a “prologue”, a vector operation and an “epilogue”. The prologue is mainly used to clear accumulators, set up address modes and pointers and the like before the vector operation can be performed. When the vector operation is completed, the result of the vector operation can be further processed by code in the "epilogue" part of the task. In SIMT processors, typically only one vector instruction is needed to perform the vector operation.
The typical layout of a DSP task is exemplified by the following example task according to the prior art: The code string in the example performs a complex scalar product calculation over 512 complex values and then stores the result in memory again. The routine requires that the following instructions be retrieved by the processor core. .cmacÛ; Assume cmac0 is selected prolog:; Address setup ldi # 0, r0 out r0, cdm0_addr out r0, cdm1_addr out r0, cdm2_addr setcmvl.5I2; Set vector length to 512 vectorop: cmac [0], [5 1], [2]; Perform cmac operation over; samples idle # cmac0; Stop program fetching until cmac0 is ready epilogue: star [3]; Store accumulator In the example above, the instructions setcmvl, cmac and star are output to and executed on the CMAC vector execution unit, while the instructions ldi, out and idle are executed on the integer core.
The vector length of the vector instructions indicates on how many data words (samples) the vector execution unit is to operate. The vector length can be set in any suitable way, for example one of the following: 1) By dedicated instructions, such as setcmvl. 123 in the example above 2) Carried in the instruction itself, for example according to the format cmac. 123, shown in FIG. 4 10 15 20 25 30 16 3) Set by a control register, for example according to the format out r0, cmac_vector_length The instruction idle # cmac0 instructs the kernel programmer ﬂ fate control to stop retrieving new instructions until the CMACO unit has completed its vector operation. After releasing the idle instruction, and allowing new instructions to be retrieved, the “Staff instruction is retrieved and output to the CMACO vector execution unit. The Star instruction instructs the CMAC vector execution unit to store the accumulator in memory.
In the next example, which also illustrates the prior art, two vector execution units are used. The instruction sequence assigned to the first vector execution unit is the same as above: .cmac0; Assume cmac0 is selected prologue:; Address setup ldi # 0, r0 out r0, cdm0__addr out r0, cdm1_addr out r0, cdm2_addr setcmvl512; Set vector length to 512 vectorop: cmac [0], [1], [2]; Per ﬁ Jrm cmac operation over; samples idle # cmac0; Stop program fetching until cmac0 is ready star [3]; Store accumulator epilogue: The instruction sequence assigned to the second vector execution unit is: .cmacl; Assume cmac] is selected prologue:; Address setup ldi # 0, r0 out r0, cdm3__addr 10 15 20 25 30 17 out r0, cdm4_addr out r0, cdm5_addr setcmv .2048; Set vector length to 2048 cmac [0], [1], [2]; Perform cmac operation over vectorop:; samples idle # cmac1; Stop program fetching until cmac0 is ready epilog: star [3]; Store accumulator In this case, the second vector execution unit is instructed to perform a vector operation of length 2048, which will take four times as long as the operation of length 512 in the first vector execution unit. The first vector execution unit will therefore end before the second vector execution unit. Since the program memory is instructed by the Idle # cmac1 instruction to wait with the next instruction until the second vector execution unit is finished, it will also not send a new instruction to the first vector execution unit until the second vector execution unit is finished. The first vector execution unit will therefore be inactive for more than 1000 clock cycles due to the idle instruction related to the second vector execution unit.
In the example above, two vector execution units are used. As will be appreciated, this poses a growing problem with the number of vector execution units, since an idle instruction related to a particular vector execution unit will potentially affect a larger number of other vector execution units.
According to the invention, this problem is reduced by providing a local queue for each vector execution unit. The local queue is arranged to receive from the program memory in the processor core one or ﬂ your instructions for its vector execution unit to be executed in sequence and to execute one instruction at a time for the vector execution. 10 15 20 25 30 18 At the same time, a command is entered instructing the local queue to hold or wait with the next instruction until a certain condition is met. The condition may be, for example, that the vector execution unit is finished with the previous command or that the data path is ready to receive a new instruction. For simplicity, this new command is referred to in this script as SYNC. The condition can be specified in the instruction word for the SYNC instruction or it can be read from the control register or from another source.
An example of a sequence of instructions using the SYNC command is given as follows: .cmac0; Select cmac0 as destination for cmac related instructions; Address setup ldi # 0, r0 out r0, cdm0_addr out r0, cdmI_addr out r0, cdm2__addr setcmvl . 512; Set vector length to 512 cmac [0], [1], [2]; Perform cmac operation over 512 samples sync; Stop program queue until cmac is ready star [3]; Store accumulator .cmacI; Select cmac] as destination for cmac related instructions; A ddress setup ldi # 0, r0 out r0, cdm3_addr out r0, cdm4_addr out r0, cdm5_addr setcmvl.2048; Set vector length to 2048 cmac [0], [1], [2]; Perform cmac operation over 2048 samples sync; Stop program queue until cmac is ready star [3]; Large accumulator In contrast to the prior art, each of these commands can be sent to the local queue of the vector execution unit in question at a time and stored there while waiting to be sent one command at a time to the instruction decoder. within the vector execution unit. As explained above, the sync command is provided to stop the local queue until the vector execution unit finishes the cmac command, which is a vector instruction and therefore takes ﬂ your clock cycles to complete.
FIG. 5 illustrates the vector instruction logic of a prior art baseband processor 700 which may be used as a starting point for the present invention.
The baseband processor includes a RISC core 701 having a program memory PM 702, which stores instructions for the various execution units of processes, and a RISC program fate controller 703. From the program memory 702, instructions are retrieved to a delivery logic unit 705, which is common to all execution units. to control where each specific instruction is to be sent. The delivery logic 705 corresponds to the Unit Field Extraction 508 units and the delivery control 509 in FIG. The output logic is in this case connected to a number of vector execution units 710, 712, 714 and via a multiplexer 715 with a RISC core + data path unit 716, the latter being part of the RISC core and corresponding to the units 505, 506, 507 and 510 in FIG. As explained above, in one embodiment, the instruction words that include the current instructions are sent to all execution units, while the output signal corresponding to a particular instruction is sent only to the execution unit that is to execute this instruction. In an alternative embodiment, the output signal is handled locally by each vector execution unit.
FIG. 6 illustrates a vector execution unit 710, which may be one of the vector execution units 710, 712, 714 in FIG. 5 according to the prior art. The vector execution unit 710 has a vector control 720, a vector length counter 721, an instruction register 722 and an instruction decoding unit 723. As shown in FIG. 5 receives the vector execution unit 710 in FIG. 6 instructions from the program memory 702, but FIG. 6 is simplified. The instruction word is the actual instruction and is received in the instruction register 722 and forwarded to the instruction decoder 723.
The output signal is received in the vector control via the output logic unit 705 and is used to control the execution of the instruction word. If the output signal is active, the instruction is loaded into the instruction register, decoded and executed, otherwise it is rejected. The vector controller 720 also handles the vector length counter 721 and other control signals used in the system which will be discussed in the following.
Traditionally, during each clock cycle, an instruction for one of the execution units can be retrieved from the program memory 702. The unit field in the instruction word can be extracted from the instruction word and used to control to which control unit the instruction is issued. For example, if the device field is “000”, the instruction can be issued to the RISC data path. This may cause the delivery logic 705 to allow the instruction word to pass through the multiplexer 715 and into the RISC core 716 (not shown in FIG. 6), while no new instructions are loaded into the vector execution units during this cycle. However, if the unit field contained any other value, the output logic 7 05 may enable the corresponding instruction output signal to the vector execution unit for which it is intended. The vector control 720 in the selected vector execution unit then allows the instruction word to pass through and into the instruction register 722 of said vector execution unit. In this case, a NOP instruction will be sent to the RISC data path instruction register in the RISC core 716.
To handle vector instructions when an instruction is given to the vector execution units, the vector length fold from the instruction word can be extracted and stored in the count register 721. This count register can be used to keep track of the vector length in the corresponding vector instruction, and when ﬂ aggan is to be sent indicating that the vector execution unit is ready to receive another instruction to be sent. When a corresponding vector execution unit has completed the vector operation, the vector control 720 may cause a signal (ﬂ agga) to be sent to the program ﬂ fate control 703 (not shown in FIG. 6) to indicate that the unit is ready to accept a new instruction. In addition, the vector control 720 of each vector execution unit 520, 530 (see FIG. 3) may generate control prologue and epilogue state signals within the execution unit. Such control signals can control VLU and VSU for vector operations and also handle Lex. odd vector lengths.
When the output logic 705 determines by decoding the unit field that a particular instruction is to be sent to a particular vector execution unit, the instruction word from the program memory 702 is loaded into the instruction register 722. Although the instruction is determined (by the vector control) to carry a vector, the count register 721 is loaded with this value of the vector length value. The vector controller 7120 decodes portions of the instruction word to determine if the instruction is a vector instruction and carries vector length information. If so, the vector control 720 activates a signal for the count register 721 to load a value indicating the vector length into the count register 721. The vector control 720 also instructs the instruction decoder 723 to begin decoding the instruction and start sending control signals to the data path 724. Instruction in the instruction register 722 is then decoded by the instruction decoder 723, whose control signals are kept in the control signal register 724 before they are sent to the data path. The counting register 721 keeps track of the number of times the instruction is to be repeated, ie. the vector length, in a conventional manner.
FIG. 7 illustrates a vector execution unit 810 according to the invention. The vector execution unit comprises all elements of the vector execution unit according to the prior art shown in FIG. 6 provided with the same reference numerals.
In addition, the vector execution unit according to the invention has a local queue 730 arranged to contain a number of instructions received from the program memory. A queue control 732 arranged to control the local queue 730 is arranged in the vector control unit 720. The queue 730 and the queue control 732 are connected to each other to exchange information and commands. For example, the queue control 732 may include a counter arranged to keep track of the number of instructions in the queue 730. Alternatively, the queue itself may keep track of its status and send information indicating that it is full or empty, or almost full or empty, to the queue control 732. Accordingly, 732 contains status information about the local queue 730 and can send control signals to start, stop or empty the local queue 730. The instruction decoder 723 is arranged to inform the vector control 730 of which instruction is currently being executed.
As explained above, many DSP tasks are implemented as a result of instructions, such as a prologue, a vector instruction, and an epilogue. The vector instructions will run for a number of clock cycles, during which time no new command can be retrieved. In this case, as explained above, the new SYNC instruction is used to cause the local queue to hold the next instruction until a certain condition is met. When the queue control 732 is informed that the instruction decoder 723 has decoded a "sync" instruction, it will set a position or mode in the queue controller 732 which stops the local queue 730 until the condition is met. This is normally implemented using the remaining vector length information and information regarding the current instruction from the instruction decoder. Flags transmitted from data path 724 to queue control 732 can also be used. The condition will typically be that the processing of the vector instruction is completed, so that the instruction decoder 723 in the vector execution unit is ready to process the next instruction.
The local queue 730 can be any type of queue suitable to contain the desired number of instructions. One type is a FIFO queue capable of containing an appropriate number, e.g. 8, instructions.
FIG. 8 illustrates a vector execution unit 910 according to a preferred embodiment of the invention. The vector execution unit shown in FIG. 8 comprises the same units as in FIG. 7, interconnected in the same way. In this embodiment, however, the local queue 740 is a cyclic queue suitable for repeating a specified number of instructions. This will be particularly advantageous in implementations where the same sequence of instructions is to be executed a large number of times, which can sometimes exceed 1000. In this case, a significant amount of bandwidth can be saved by not having to send the same instructions again from the core unit. to the Vector Execution Unit each time they are to be executed.
As in FIG. 7 there is a queue control 732 arranged in the vector control 720 ”. In the embodiment of FIG. A buffer manager 744 is also provided to keep track of the instructions to be repeated, and the number of times an instruction is to be repeated. For this purpose, there are two registers which are also controlled by the vector control 720: a repetition register 746 for storing the number of repetitions of the instruction and an instruction counter register 748 arranged to contain the number of instructions to be repeated. 10 15 20 25 30 23 If all instructions given to the vector execution unit pass queue 740, i.e. the cyclic buffer, the buffer will remember the last N (typically 8-16) instructions.
The repetition register 746 is configured to contain the number of repetitions to be performed. The repeat register 746 may be loaded through the control register ﬁ or locked from the instruction word given to the vector execution unit or by any other method.
The instruction counter register 748 is configured to contain the number indicating how many instructions in the cyclic buffer 740 are to be included in the iterative loop. The instruction counter register can be loaded through the control register eller or read from the instruction word given to the vector execution unit or by any other method.
When a "repeat instruction" or instruction with a "repeat" set is provided to the vector execution unit, the instruction decoder 723 in conjunction with the vector controller 720 instructs the queue controller 732 to issue instructions from the cyclic buffer 740 to the instruction register 722.
When, as in FIG. 7, a “sync” instruction is found by the instruction decoder 723, the instruction decoder instructs the queue controller 732 to stop retrieving instructions from the local, cyclic, queue until a predetermined condition is met.
This condition is typically that the previous instruction, which is taken from the queue, has been terminated, so that the decoder is ready to receive a new instruction.
Although the local queue 73 0, 740 and the instruction register 722 in this document are shown as separate units, it would be possible to combine them into one unit. The instruction register 722 could, for example, be integrated as the last element in the local queue.
The buffer manager 744 monitors the operation of the local buffer 740 and handles repetition of the instructions currently stored in the circular buffer, while the queue controller 732 handles the delivery of the start / stop instruction from the circular buffer queue 740.
The buffer manager 744 further handles the repetition register 746 and keeps track of how many repetitions have been performed. When the number of repetitions specified in the repetition register 746 has been performed, a signal is sent to the vector control 720 °, which can then be sent to the send-to-program fate control 703 (not shown in FIG. 8) to indicate that the operation is completed.
When the requested number of iterations has been completed, the appearance of the circular buffer 740 returns to the queue functionality and stores the last issued instructions, so that a new iteration instruction can be started.
FIG. 9 illustrates the working principle of the local sex according to an embodiment of the invention. The queue itself is represented by a horizontal line 901. A first vertical arrow symbolizes the writing pointer 903, which indicates the position of the queue in which a new instruction is currently being written. A corresponding horizontal arrow 905 indicates the direction in which the writing pointer moves, to the right of the drawing.
A second vertical arrow symbolizes the reading pointer 907, which indicates the position of the queue from which an instruction to be executed is currently being read. A corresponding horizontal arrow 909 indicates the direction in which the reading pointer moves, in the same direction as the writing pointer 903. The distance between the writing pointer 903 and the reading pointer 907 is the prevailing length of the queue, ie. the number of instructions currently in the queue.
In the example of FIG. 9 has a sequence of instructions to be repeated a number of times written to the queue. The beginning of the sequence and the end of the sequence are indicated by a first 911 and a second 913 vertical line across the horizontal line 901. A backward arrow 915 indicates that when the reading pointer 907 reaches the end of the sequence it is followed by commands indicated by the second vertical line. 913, the reading pointer will jump back to the beginning of the sequence of descendants indicated by the first vertical line 911. This will be repeated until the sequence of instructions has been executed the specified number of times. The control logic (not shown) is arranged to keep track of the number of instructions in the sequence to be iterated, and their position in the queue. This includes, for example: 0 Position 911 for the beginning of the sequence of instructions to be repeated 5 0 Position 913 for the end of the sequence of instructions to be repeated 0 The number of times the sequence of instructions is to be repeated Instead of the beginning and end of the sequence, the position for either the beginning or the end of the sequence is stored together with the length of the sequence, i.e. the number of 10 instructions included in the sequence.

权利要求:
Claims (8)
[1]
A vector execution unit (203, 205, 520, 530) for use in a digital signal processor (200), the vector execution unit being arranged to execute instructions, including vector instructions which are to be executed on many data in the form of a vector, comprising a vector guide (720, 720 ") arranged to determine if an instruction is a vector instruction and, if it is, to inform a count register (721) arranged to contain the vector length, the vector control (720, 720 °) further is arranged to control the execution of instructions, the vector execution unit being characterized in that - it comprises a local queue (73 0) arranged to receive at least a first and a second instruction from a program memory and to contain the second instruction in the local queue until a predetermined condition is met, and that - the vector control (720, 720 °) comprises queue control means (732, 721) arranged to control the local queue.
[2]
The vector execution unit according to claim 1, further arranged to receive a delivery signal and to control the execution of instructions based on this delivery signal.
[3]
The vector execution unit according to claim 1 or 2, further comprising - an instruction register (722) arranged to receive and store instructions - an instruction decoder (723) arranged to decode instructions stored in the instruction register - a number of data paths controlled by the instruction decoder
[4]
A vector execution unit according to any one of the preceding claims, wherein the local queue (73 0) is arranged to pause the reading of instructions until the data path is ready to receive and execute another instruction. A vector execution unit according to any one of the preceding claims, wherein said queue control means (732) comprises a queue control arranged to contain status information related to the queue, such as how full the queue (730) is, and to control transmitted the delivery of instructions from the local queue (730) to the Vector Execution Unit (203, 205, 520, 530) for execution. The vector execution unit according to claim 5, wherein the queue control is arranged to generate an error message if a new instruction is sent to the queue and the queue is full. The vector execution unit according to claim 6, wherein said queue control means (732) is arranged to emit a specific signal instructing the local queue to pause the reading of instructions from the local queue until the condition is met. The vector execution unit according to any one of the preceding claims, wherein the vector control (720, 720 ”) is arranged to cause a signal to be transmitted to a program control (703) of the digital signal processor to indicate that the unit is ready to accept a new instruction. Vector execution unit according to one of the preceding claims, wherein the instruction decoder (723) is arranged to inform the vector control (720, 720 °) of the instruction which is executed at any given time. Vector execution unit according to any one of the preceding claims, wherein the local queue is a first-in-first-out queue. Digital signal processor (200) comprising: - a processor core (201) comprising an integer execution unit (212) configured to execute integer instructions; and - at least one first and a second Vector execution unit (203, 205, 520, 530) separate from and coupled to the processor core (201), each Vector execution unit being a Vector execution unit (203, 205) according to any one of the preceding claims; wherein the digital signal processor comprises a program memory (211) arranged to contain instructions for the first and second vector execution units (203, 205) and output logic for outputting input.
[5]
5. 1
[6]
6. 1
[7]
7. 1
[8]
8. instructions, including vector instructions, for the first and second vector execution units. The digital signal processor according to claim 11, wherein the program memory (211) is also arranged to contain instructions for the integer execution unit (212). Digital signal processor according to any one of claims 11-12, wherein the program memory (21 1) is arranged in the processor core (201). Baseband communication device suitable for wired and wireless location communication, comprising: - a front-end unit (7) configured to transmit and receive communication signals; a programmable digital signal processor (3) coupled to the analog front end unit, the programmable digital signal processor being a digital signal processor according to any one of claims 9-12. The baseband communication device according to claim 14, wherein the front end unit (7) is an analog front end unit arranged to transmit and / or receive radio frequency or baseband signals. The baseband communication device according to claim 14 or 15, wherein the baseband communication device is a device for communication in a wireless communication network, such as a cellular communication network. The baseband communication device of claim 14, wherein the baseband communication device is a television receiver. The baseband communication device of claim 14, wherein the baseband communication device is a cable modem.

类似技术:

公开号 | 公开日 | 专利标题

US20140281373A1|2014-09-18|Digital signal processor and baseband communication device

SE1150966A1|2013-04-19|Digital signal processor and baseband communication device

WO2012151331A1|2012-11-08|Methods and apparatus for constant extension in a processor

WO2016100142A2|2016-06-23|Advanced processor architecture

EP2352082B1|2018-11-28|Data processing device for performing a plurality of calculation processes in parallel

WO2015114305A1|2015-08-06|A data processing apparatus and method for executing a vector scan instruction

US20140317383A1|2014-10-23|Apparatus and method for compressing instruction for vliw processor, and apparatus and method for fetching instruction

WO2012061416A1|2012-05-10|Methods and apparatus for a read, merge, and write register file

US10303399B2|2019-05-28|Data processing apparatus and method for controlling vector memory accesses

KR20140105805A|2014-09-02|Digital signal processor and baseband communication device

US20070226468A1|2007-09-27|Arrangements for controlling instruction and data flow in a multi-processor environment

CN110377339A|2019-10-25|Long-latency instruction processing unit, method and equipment, readable storage medium storing program for executing

US9557996B2|2017-01-31|Digital signal processor and method for addressing a memory in a digital signal processor

JP5786719B2|2015-09-30|vector processor

CN112074810A|2020-12-11|Parallel processing apparatus

CN104011674A|2014-08-27|Digital signal processor

JP2009140514A|2009-06-25|Semiconductor device

同族专利:

公开号 | 公开日

CN103890718A|2014-06-25|

ES2688603T3|2018-11-05|

US20140244970A1|2014-08-28|

KR20140078717A|2014-06-25|

WO2013058695A1|2013-04-25|

CN103890718B|2016-08-24|

EP2751668A1|2014-07-09|

SE536462C2|2013-11-26|

EP2751668B1|2018-08-01|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

JPS6043535B2|1979-12-29|1985-09-28|Fujitsu Ltd|

JP2810068B2|1988-11-11|1998-10-15|株式会社日立製作所|Processor system, computer system, and instruction processing method|

US5179530A|1989-11-03|1993-01-12|Zoran Corporation|Architecture for integrated concurrent vector signal processor|

US6658556B1|1999-07-30|2003-12-02|International Business Machines Corporation|Hashing a target address for a memory access instruction in order to determine prior to execution which particular load/store unit processes the instruction|

US7231193B2|2004-04-13|2007-06-12|Skyworks Solutions, Inc.|Direct current offset correction systems and methods|

US7990949B2|2004-11-09|2011-08-02|Broadcom Corporation|Enhanced wide area network support via a broadband access gateway|

US7543119B2|2005-02-10|2009-06-02|Richard Edward Hessel|Vector processor|

US7415595B2|2005-05-24|2008-08-19|Coresonic Ab|Data processing without processor core intervention by chain of accelerators selectively coupled by programmable interconnect network and to memory|

US7299342B2|2005-05-24|2007-11-20|Coresonic Ab|Complex vector executing clustered SIMD micro-architecture DSP with accelerator coupled complex ALU paths each further including short multiplier/accumulator using two's complement|

US20070198815A1|2005-08-11|2007-08-23|Coresonic Ab|Programmable digital signal processor having a clustered SIMD microarchitecture including a complex short multiplier and an independent vector load unit|

US20130185538A1|2011-07-14|2013-07-18|Texas Instruments Incorporated|Processor with inter-processing path communication|US20160226544A1|2015-02-04|2016-08-04|GM Global Technology Operations LLC|Adaptive wireless baseband interface|

EP3125108A1|2015-07-31|2017-02-01|ARM Limited|Vector processing using loops of dynamic vector length|

CN107315568B|2016-04-26|2020-08-07|中科寒武纪科技股份有限公司|Device for executing vector logic operation|

US10713045B2|2018-01-08|2020-07-14|Atlazo, Inc.|Compact arithmetic accelerator for data processing devices, systems and methods|

法律状态:
2021-06-01| NUG| Patent has lapsed|

优先权:

申请号 | 申请日 | 专利标题

SE1150966A|SE536462C2|2011-10-18|2011-10-18|Digital signal processor and baseband communication device|SE1150966A| SE536462C2|2011-10-18|2011-10-18|Digital signal processor and baseband communication device|

PCT/SE2012/050979| WO2013058695A1|2011-10-18|2012-09-17|Digital signal processor and baseband communication device|

KR1020147011833A| KR20140078717A|2011-10-18|2012-09-17|Digital signal processor and baseband communication device|

US14/350,538| US20140244970A1|2011-10-18|2012-09-17|Digital signal processor and baseband communication device|

EP12784087.4A| EP2751668B1|2011-10-18|2012-09-17|Digital signal processor and baseband communication device|

CN201280051515.3A| CN103890718B|2011-10-18|2012-09-17|Digital signal processor and baseband communication equipment|

ES12784087.4T| ES2688603T3|2011-10-18|2012-09-17|Digital signal processor and baseband communication device|

[返回顶部]